Clustering by Compression 

Rudi Cilibrasi* Paul Vitanyi t 

CWI CWI and University of Amsterdam 



Abstract 

We present a new method for clustering based on compression. The method doesn't use subject-specific features 
or background knowledge, and works as follows: First, we determine a universal similarity distance, the normal- 
ized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise 
concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted 
to a specific application area, and works across application area boundaries. A theoretical precursor, the normal- 
ized information distance, co-developed by one of the authors, is provably optimal in the sense that it minorizes 
every computable normalized metric that satisfies a certain density requirement. However, the optimality comes at 
the price of using the non-computable notion of Kolmogorov complexity. We propose precise notions of similar- 
ity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that 
approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram 
(binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available 
as public software, and is robust under choice of different compressors. To substantiate our claims of universality 
and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, 
literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, 
using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major 
questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the 
Marsupionta hypothesis against the Theria hypothesis. 

1 Introduction 

All data are created equal but some data are more alike than others. We propose a method expressing this alikeness, 
using a new similarity metric based on compression. This metric doesn't use any features or background knowledge, 
and can without changes be applied to different areas and across area boundaries. It is robust in the sense that its 
success appears independent from the type of compressor used. The clustering we use is hierarchical clustering 
in dendrograms based on a new fast heuristic for the quartet method. The method is available as an open-source 
software tool. Below we explain the method, the theory underpinning it, and present evidence for its universality 
and robustness by experiments and results in a plethora of different areas using different types of compressors. 

Feature-Based Similarities: We are presented with unknown data and the question is to determine the similari- 
ties among them and group like with like together. Commonly, the data are of a certain type: music files, transaction 
records of ATM machines, credit card applications, genomic data. In these data there are hidden relations that we 
would like to get out in the open. For example, from genomic data one can extract letter- or block frequencies (the 
blocks are over the four-letter alphabet); from music files one can extract various specific numerical features, related 
to pitch, rhythm, harmony etc. One can extract such features using for instance Fourier transforms |41 1 or wavelet 
transforms [ 16 1. The feature vectors corresponding to the various files are then classified or clustered using existing 
classification software, based on various standard statistical pattern recognition classifiers |41|, Bayesian classi- 
fiers [ 13 1, hidden Markov models Kill, ensembles of nearest-neighbor classifiers [ 16 1 or neural networks 1131 1371 . 
For example, in music one feature would be to look for rhythm in the sense of beats per minute. One can make a 
histogram where each histogram bin corresponds to a particular tempo in beats-per-minute and the associated peak 
shows how frequent and strong that particular periodicity was over the entire piece. In |41 1 we see a gradual change 
from a few high peaks to many low and spread-out ones going from hip-hip, rock, jazz, to classical. One can use 
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this similarity type to try to cluster pieces in these categories. However, such a method requires specific and detailed 
knowledge of the problem area, since one needs to know what features to look for. 

Non-Feature Similarities: Our aim is to capture, in a single similarity metric, every effective metric: effective 
versions of Hamming distance, Euclidean distance, edit distances, alignment distance, Lempel-Ziv distance |9|, and 
so on. This metric should be so general that it works in every domain: music, text, literature, programs, genomes, 
executables, natural language determination, equally and simultaneously. It would be able to simultaneously detect 
all similarities between pieces that other effective metrics can detect. 

Compression-based Similarity: Such a "universal" metric was co-developed by us in II27II28||2"9"1 . as a nor- 
malized version of the "information metric" of 1 30 4|. Roughly speaking, two objects are deemed close if we can 
significantly "compress" one given the information in the other, the idea being that if two pieces are more similar, 
then we can more succinctly describe one given the other. The mathematics used is based on Kolmogorov com- 
plexity theory 1 30 1 . In |29| we defined a new class of metrics, taking values in [0, 1] and appropriate for measuring 
effective similarity relations between sequences, say one type of similarity per metric, and vice versa. It was shown 
that an appropriately "normalized" information distance minorizes every metric in the class. It discovers all effective 
similarities in the sense that if two objects are close according to some effective similarity, then they are also close 
according to the normalized information distance. Put differently, the normalized information distance represents 
similarity according to the dominating shared feature between the two objects being compared. The normalized in- 
formation distance too is a metric and takes values in [0, 1]; hence it may be called "the" similarity metric. To apply 
this ideal precise mathematical theory in real life, we have to replace the use of the noncomputable Kolmogorov 
complexity by an approximation using a standard real-world compressor. Earlier approaches resulted in the first 
completely automatic construction of the phylogeny tree based on whole mitochondrial genomes, 1271 1281 l29l . a 
completely automatic construction of a language tree for over 50 Euro- Asian languages 1291 . detects plagiarism in 
student programming assignments |38 1, gives phylogeny of chain letters |5 1, and clusters music |8 1. Moreover, the 
method turns out to be robust under change of the underlying compressor-types: statistical (PPMZ), Lempel-Ziv 
based dictionary (gzip), block based (bzip2), or special purpose (Gencompress). 

Related Work: In view of the simplicity and naturalness of our proposal, it is perhaps surprising that compres- 
sion based clustering and classification approaches did not arise before. But recently there have been several partially 
independent proposals in that direction: Q] [2) for building language trees — while citing 1 30 4 1 — is by essentially 
more ad hoc arguments about empirical Shannon entropy and Kullback-Leibler distance. This approach is used to 
cluster music MIDI files by Kohonen maps in 1311 . Another recent offshoot based on our work is 1211 hierarchi- 
cal clustering based on mutual information. In a related, but considerably simpler feature-based approach, one can 
compare the word frequencies in text files to assess similarity. In |40| the word frequencies of words common to 
a pair of text files are used as entries in two vectors, and the similarity of the two files is based on the distance 
between those vectors. The authors attribute authorship to Shakespeare plays, the Federalist Papers, and the Chinese 
classic "The Dream of the Red Chamber." The approach to similarity distances based on block occurrence statistics 
is standard in genomics, and in an experiment below it gives inferior phylogeny trees compared to our compression 
method (and wrong ones according to current biological wisdom). The possibly new feature in the cited work is that 
it uses statistics of only the words that the files being compared have in common. A related, opposite, approach was 
taken in 1 20 1, where literary texts are clustered by author gender or fact versus fiction, essentially by first identifying 
distinguishing features, like gender dependent word usage, and then classifying according to those features. 

Outline: Here we propose a first comprehensive theory of real-world compressor-based normalized compres- 
sion distance, a novel hierarchical clustering heuristic, together with many new applications. First, we define new 
mathematical notions of "similarity metric," "normal compressor," and "normalized compression distance." We then 
prove the normalized compression distance based on a normal compressor to be a similarity metric. The normalized 
compression distance is shown to be quasi-universal in the sense that it minorizes every computable similarity metric 
up to an additive error term that depends on the quality of the compressor's approximation of the true Kolmogorov 
complexities of the files concerned, and that under certain conditions vanishes with increasing file length. This 
means that the NCD captures the dominant similarity over all possible features of the objects compared, up to the 
stated precision. Next, we present a method of hierarchical clustering based on a novel fast randomized hill-climbing 
heuristic of a new quartet tree optimization criterion. Given a matrix of the pairwise similarity distances between the 
objects, we score how well the resulting tree represents the information in the distance matrix on a scale of to 1 . 
Then, as proof of principle, we run the program on three data sets, where we know what the final answer should be: 
(i) reconstruct a tree from a distance matrix obtained from a randomly generated tree; (ii) reconstruct a tree from files 
containing artificial similarities; and (iii) reconstruct a tree from natural files of vastly different types. To substantiate 
our claim of universality, we apply the method to different areas, not using any feature analysis at all. We first give 
an example in whole-genome phylogeny using the whole mitochondrial DNA of the species concerned. We compare 
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the hierarchical clustering of our method with a more standard method of two-dimensional clustering (to show that 
our dendrogram method of depicting the clusters is more informative). We give a whole-genome phylogeny of fungi 
and compare this to results using alignment of selected proteins (alignment being often too costly to perform on the 
whole-mitochondial genome, but the disadvantage of protein selection being that different selections usually result 
in different phylogenies — so which is right?). We identify the virii that are closest to the sequenced SARS virus; 
we give an example of clustering of language families; Russian authors in the original Russian, the same pieces in 
English translation (clustering partially follows the translators); clustering of music in MIDI format; clustering of 
handwritten digits used for optical character recognition; and clustering of radio observations of a mysterious astro- 
nomical object, a microquasar of extremely complex variability. In all these cases the method performs very well in 
the following sense: The method yields the phylogeny of 24 species precisely according to biological wisdom. The 
probability that it randomly would hit this one outcome, or anything reasonably close, is very small. In clustering 36 
music pieces taken equally many from pop, jazz, classic, so that 12-12-12 is the grouping we understand is correct, 
we can identify convex clusters so that only six errors are made. (That is, if three items get dislodged then six items 
get misplaced.) The probability that this happens by chance is extremely small. The reason why we think the method 
does something remarkable is concisely put by Laplace 1261 : 

"If we seek a cause wherever we perceive symmetry, it is not that we regard the symmetrical event 
as less possible than the others, but, since this event ought to be the effect of a regular cause or that 
of chance, the first of these suppositions is more probable than the second. On a table we see letters 
arranged in this order C onstantinopl e, and we judge that this arrangement is not the 
result of chance, not because it is less possible than others, for if this word were not employed in any 
language we would not suspect it came from any particular cause, but this word being in use among us, 
it is incomparably more probable that some person has thus arranged the aforesaid letters than that this 
arrangement is due to chance." 

Materials and Methods: The data samples we used were obtained from standard data bases accessible on the 
world-wide web, generated by ourselves, or obtained from research groups in the field of investigation. We supply 
the details with each experiment. The method of processing the data was the same in all experiments. First, we 
preprocessed the data samples to bring them in appropriate format: the genomic material over the four-letter alphabet 
{A, T, G, C} is recoded in a four-letter alphabet; the music MIDI files are stripped of identifying information such as 
composer and name of the music piece. Then, in all cases the data samples were completely automatically processed 
by our CompLearn Toolkit, rather than as is usual in phylogeny, by using an ecclectic set of software tools per 
experiment. Oblivious to the problem area concerned, simply using the distances according to the NCD below, the 
method described in this paper fully automatically classifies the objects concerned. The method has been released 
in the public domain as open-source software at http://complearn.sourceforge.net/ . The CompLearn Toolkit is a 
suite of simple utilities that one can use to apply compression techniques to the process of discovering and learning 
patterns in completely different domains. In fact, this method is so general that it requires no background knowledge 
about any particular subject area. There are no domain-specific parameters to set, and only a handful of general 
settings. 

The Complearn Toolkit using NCD and not, say, alignment, can cope with full genomes and other large data files 
and thus comes up with a single distance matrix. The clustering heuristic generates a tree with a certain confidence, 
called standardized benefit score or S(T) value in the sequel. Generating trees from the same distance matrix many 
times resulted in the same tree or almost the same tree, for all distance matrices we used, even though the heuristic 
is randomized. The differences that arose are apparently due to early or late termination with different S(T) values. 
This is a great difference with previous phylogeny methods, where because of computational limitations one uses 
only parts of the genome, or certain proteins that are viewed as significant 1191 . These are run through a tree 
reconstruction method like neighbor joining |36|, maximum likelihood, maximum evolution, maximum parsimony 
as in 1 19 1, or quartet hypercleaning |6|, many times. The percentage-wise agreement on certain branches arising 
are called "bootstrap values." Trees are depicted with the best bootstrap values on the branches that are viewed 
as supporting the theory tested. Different choices of proteins result in different best trees. One way to avoid this 
ambiguity is to use the full genome, 1 34- 29 1, leading to whole-genome phylogeny. With our method we do whole- 
genome phylogeny, and end up with a single overall best tree, not optimizing selected parts of it. 

The quality of the results depends on (a) the NCD distance matrix, and (b) how well the hierarchical tree rep- 
resents the information in the matrix. The quality of (b) is measured by the S(T) value, and is given with each 
experiment. In general, the S(T) value deteriorates for large sets. We believe this to be partially an artifact of a 
low-resolution NCD matrix due to limited compression power, and limited file size. The main reason, however, is 
the fact that with increasing size of a natural data set the projection of the information in the NCD matrix into a 
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binary tree gets increasingly distorted. Another aspect limiting the quality of the NCD matrix is more subtle. Recall 
that the method knows nothing about any of the areas we apply it to. It determines the dominant feature as seen 
through the NCD filter. The dominant feature of alikeness between two files may not correspond to our a priori 
conception but may have an unexpected cause. The results of our experiments suggest that this is not often the case: 
In the natural data sets where we have preconceptions of the outcome, for example that works by the same authors 
should cluster together, or music pieces by the same composers, musical genres, or genomes, the outcomes conform 
largely to our expectations. For example, in the music genre experiment the method would fail dramatically if genres 
were evenly mixed, or mixed with little bias. However, to the contrary, the separation in clusters is almost perfect. 
The few misplacements that are discernable are either errors (the method was not powerful enough to discern the 
dominant feature), or the dominant feature between a pair of music pieces is not the genre but some other aspect. 
The surprising news is that we can generally confirm expectations with few misplacements, indeed, that the data 
don't contain unknown rogue features that dominate to cause spurious (in our preconceived idea) clustering. This 
gives evidence that where the preconception is in doubt, like with phylogeny hypotheses, the clustering can give true 
support of one hypothesis against another one. 

Figures: We use two styles to display the hierarchical clusters. In the case of genomics of Eutherian orders 
and fungi, language trees, it is convenient to follow the dendrograms that are customary in that area (suggesting 
temporal evolution) for easy comparison with the literature. Although there is no temporal relation intended, the 
dendrogram representation looked also appropriate for the Russian writers, and translations of Russian writers. In 
the other experiments (even the genomic SARS experiment) it is more informative to display an unrooted ternary 
tree (or binary tree if we think about incoming and outgoing edges) with explicit internal nodes. This facilitates 
identification of clusters in terms of subtrees rooted at internal nodes or contiguous sets of subtrees rooted at branches 
of internal nodes. 

2 Similarity Metric 

In mathematics, different distances arise in all sorts of contexts, and one usually requires these to be a "metric". We 
give a precise formal meaning to the loose distance notion of "degree of similarity" used in the pattern recognition 
literature. 

Metric: Let £2 be a nonempty set and be the set of nonnegative real numbers. A metric on £2 is a function 
D : £2 x £2 — » + satisfying the metric (in)equalities: 

• D(x,y) = iff x = y, 

• D(x,y) = D(y,x) (symmetry), and 

• D(x,y) < D(x,z) + D(z,y) (triangle inequality). 

The value D(x,y) is called the distance between x,y £ £2. A familiar example of a metric is the Euclidean metric, 
the everyday distance e(a,b) between two objects a,b expressed in, say, meters. Clearly, this distance satisfies the 
properties e(a,a) = 0, e(a,b) = e(b,a), and e(a,b) < e(a,c) +e(c,b) (for instance, a = Amsterdam, b = Brussels, 
and c = Chicago.) We are interested in "similarity metrics". For example, if the objects are classical music pieces 
then the function D(a,b) = if a and b are by the same composer and D(a,b) = 1 otherwise, is a similarity metric. 
This metric captures only one similarity aspect (feature) of music pieces, presumably an important one because it 
subsumes a conglomerate of more elementary features. 

Density: In defining a class of acceptable metrics we want to exclude unrealistic metrics like f(x,y) = i for 
every pair i^y. We do this by restricting the number of objects within a given distance of an object. As in |0J we 
do this by only considering effective distances, as follows. Fix a suitable, and for the remainder of the paper, fixed, 
programming language. This is the reference programming language. 

Definition 2.1 Let £2 = £*, with E a finite nonempty alphabet and E* the set of finite strings over that alphabet. Note 
that for us "files" in computer memory are finite binary strings. A function D:£2x£2^2£, + isan acceptable metric 
if for every pair of objects x,y e £2 the distance D(x,y) is the length of a binary prefix code-word that is a program 
that computes x from y, and vice versa, in the reference programming language, and the metric (in)equalities hold 
up to (9(logn) where n is the maximal binary length of an element of £2 involved in the (in)equality concerned. 

Example 2.2 In representing the Hamming distance d between x and y strings of equal length n differing in positions 
ii,... , id, we can use a simple prefix-free encoding of (n,d,i\,. . . ,i<j) in H(x,y) = 21ogn + 41oglog« + 2 + <ilogn 
bits. We encode n and d prefix-free in logn + 21oglog« + 1 bits each, see e.g. [30|, and then the literal indexes of the 



4 



actual flipped-bit positions. Hence, H(x,y) is the length of a prefix code word (prefix program) to compute x from y 
and vice versa. Then, by the Kraft inequality, see 1 10 1, 

£ 2 -fffaO<i. (2.1) 



It is easy to verify that H is a metric in the sense that it satisfies the metric (in)equalities up to O(logn) additive 
precision. 

Normalization: Large objects (in the sense of long strings) that differ by a tiny part are intuitively closer than 
tiny objects that differ by the same amount. For example, two whole mitochondrial genomes of 18,000 bases that 
differ by 9,000 are very different, while two whole nuclear genomes of 3 x 10 9 bases that differ by only 9,000 bases 
are very similar. Thus, absolute difference between two objects doesn't govern similarity, but relative difference 
appears to do so. 

Definition 2.3 A compressor is a lossless encoder mapping E* into {0, 1}* such that the resulting code is a prefix 
code. For convenience of notation we identify "compressor" with a "code word length function" C : £* — > 3\£ , where 
5\£ is the set of nonnegative integers. The compressed version of a file x is denoted by x* and its length is C(x) — \x*\. 
We only consider compressors such that C(x) < \x\ + (9(log \x\. 

Since the compressor is a lossless encoder, if x ^ y then x* ^ y*. In the following we fix a compressor C, at this stage 
it doesn't matter which one. We call the fixed compressor the reference compressor. 

Definition 2.4 A normalized metric or similarity metric, relative to a reference compressor C, is a function d : 
£2 x £2 — > [0, 1] that for every constant e G [0, 1] satisfies the density constraint 

|{(*,y) : d(x,y) < e < l,C(y) < C(x)}\ < 2 ( 1 +«) c «+\ (2.2) 

and satisfies the metric (in)equalities up to additive precision O ( (log n)/n) where n is the maximal binary length of 
an element of £2 involved in the (in)equality concerned. 

This requirement follows from a "normalized" version of the Kraft inequality: 

Lemma 2.5 Let d : £2 x £2 — > [0, 1] satisfy 

£ 2 -(l+d(x,y))C(x) < L (2 3) 

Then, d satisfies \2.2\ . 

PROOF. For suppose the contrary: there is an e £ [0, 1], such that J2.2I is false. Then, starting from \23\ we 
obtain a contradiction: 

1 > £ 2 - ( - l+d(x ^ c ^'> > 2 - (1+e)cW > 2 { > l+e)c(x)+l 2- { - x+e]c(x) > 1. 

y^x y^tx&d (x,y) <e<l &C(y) <C(x) 

□ 

We call a normalized metric a "similarity" metric, because it gives a relative similarity according to the metric 
distance (with distance when objects are maximally similar and distance 1 when the are maximally dissimilar) and, 
conversely, for every well-defined computable notion of similarity we can express it as a metric distance according to 
our definition. In the literature a distance that expresses lack of similarity (like ours) is often called a "dissimilarity" 
distance or a "disparity" distance. 

Remark 2.6 As far as the authors know, the idea of normalized metric is, surprisingly, not well-studied. An excep- 
tion is 1 39 1, which investigates normalized metrics to account for relative distances rather than absolute ones, and 
it does so for much the same reasons as in the present work. An example there is the normalized Euclidean metric 
\x — y\/(\x\ + \y\), where x,y S mj 1 denotes the real numbers) and | • | is the Euclidean metric — the norm. 
Another example is a normalized symmetric-set-difference metric. But these normalized metrics are not necessarily 
effective in that the distance between two objects gives the length of an effective description to go from either object 
to the other one. 
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Example 2.7 The prefix-code for the Hamming distance H(x,y) between x,y E {0, 1 }" in Example |2.2l is a program 
to compute from x toy and vice versa. It is metric up to <9(logn) additive precision. To turn it into a similarity metric 
define h{x,y) = H(x,y)/C„, where C„ — max{C(x) : |jc] = n). Trivially, < h(x,y) < 1. The metric properties of 
H(x,y) that held up to additive precision (9(logn) are preserved under division by C„ with the precision improved 
to 0((log«)/«). Since C{x)/C n < 1 we have 2^ M=H=B 2-*M C M < 1. Then, E y ^,| y |= W ^ 1 2-( 1+ *( J «'» c W < 
Yi\x\=n'^~ C — 1» where the last inequality is the Kraft inequality since C(x) is a prefix code word length. Hence 
h(x,y) satisfies i2.3\ and therefore 12.21 by Lemma l2~5l (} 



3 Normalized Compression Distance 

In 1 4 29 1 the conditions and definitions on a similarity metric are the particular instance where the compressor C is 
instantiated as the ultimate compressor K such that K(x) is the Kolmogorov complexity 1 30 1 of x. In 1 29 1, it is shown 
that one can represent the entire wide class of the similarity metrics (the class based on the ultimate compressor K) 
by a single representative: the "normalized information distance" is a metric, and it is universal in the sense that 
this single metric uncovers all similarities simultaneously that the various metrics in the class uncover separately. 
This should be understood in the sense that if two files (of whatever type) are similar (that is, close) according to 
the particular feature described by a particular metric, then they are also similar (that is, close) in the sense of the 
normalized information metric. This justifies calling the latter the similarity metric. However, this metric is based on 
the notion of Kolmogorov complexity. The Kolmogorov complexity of a file is essentially the length of the ultimate 
compressed version of the file. Unfortunately, the Kolmogorov complexity of a file is non-computable in the Turing 
sense. In applying the approach, we have to make do with an approximation based on a far less powerful real-world 
reference compressor C. The resulting applied approximation of the "normalized information distance" is called the 
normalized compression distance (NCD) and is defined by 

Ajrnl \ C(xy)-mm{C(x),C(y)} nn 

NCD{x,y) = . (3.1) 

max{C(x),C(y)} 

Here, C(xy) denotes the compressed size of the concatenation of x and y, C(x) denotes the compressed size of x, and 
C(y) denotes the compressed size of y. The NCD is a non-negative number < r < 1 + £ representing how different 
the two files are. Smaller numbers represent more similar files. The e in the upper bound is due to imperfections in 
our compression techniques, but for most standard compression algorithms one is unlikely to see an e above 0. 1 (in 
our experiments gzip and bzip2 achieved NCD's above 1, but PPMZ always had NCD at most 1). 

Remark 3.1 Technically, the Kolmogorov complexity of x given y is the length of the shortest binary program that on 
input}' outputs x; it is denoted as K(x\y). For precise definitions, theory and applications, see |30|. The Kolmogorov 
complexity of x is the length of the shortest binary program with no input that outputs x; it is denoted as K(x) = 
K(x\e) where e denotes the empty input. The similarity metric in |29| is max{K(x\y),K(y\x)} / max{K(x),K(y)}. 
Approximation of the denominator by a given compressor is straightforward by max{C(x),C(y)}. The numerator is 
more tricky. It can be written as 

mi*{K(x,y)-K{x),K(x,y)-K(y)}, (3.2) 



within logarithmic additive precision by the additive property of Kolmogorov complexity |30|. The term K(x,y) 
represents the length of the shortest program for the pair (x,y). In compression practice it is easier to deal with the 
concatenation xy or yx. Again, within logarithmic precision K(x,y) = K(xy) = K(yx). But we have to deal with a 
real-life compressor here, for which C(xy) may be different from C(yx). Clearly, however, the smaller of the two will 
be the closest approximation to K(x,y). Therefore, following a suggestion by Steven de Rooij, one can approximate 
( 13.21 best by min{C(xy),C(yx)} — min{C(x),C(y)}. Here, and in the CompLearn Toolkit, however, we simply use 
C(xy) rather than min{C(xy),C(yx)}. This is justified by the observation that block-coding based compressors are 
symmetric almost by definition, and experiments with various stream-based compressors (gzip, PPMZ) show only 
small deviations from symmetry. In our definition of a "normal" compressor below we put symmetry as one of the 
basic properties. 

The theory as developed for the Kolmogorov-complexity based "normalized information distance" in 1 29 1 does not 
hold directly for the (possibly poorly) approximating NCD. Below, we develop the theory of NCD based on the 
notion of a "normal compressor," and show that the NCD is a (quasi-) universal similarity metric relative to a normal 
reference compressor C. The theory developed in 11291 is the boundary case C = K, where the "quasi-universality" 
below has become full "universality". 
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3.1 Normal Compressor 



The properties of the NCD defined below depend on the details of the compressor used. However, under mild and 
natural assumptions on the compressor one can show the NCD is a (quasi-)universal similarity metric. 

Definition 3.2 A compressor C is normal if it satisfies, up to an additive (9(logn) term, with n the maximal binary 
length of an element of Q. involved in the (in)equality concerned, the following: 

1. Idempotency: C(xx) = C(x). 

2. Monotonicity: C(xy) > C(x). 

3. Symmetry: C(xy) = C(yx). 

4. Distributivity: C(xy)+C(z) < C(xz)+C(yz). 

Idempotency: A reasonable compressor will see exact repetitions and obey idempotency up to the required 
precision. 

Monotonicity: A real compressor must have the monotonicity property, at least up to the required precision. 
The property is evident for stream-based compressors, and only slightly less evident for block-coding compressors. 

Symmetry: Stream-based compressors of the Lempel-Ziv family, like gzip and pkzip, and the predictive PPM 
family, like PPMZ, are possibly not precisely symmetric. This is related to the stream-based property: the initial file x 
may have regularities to which the compressor adapts; after crossing the border to y it must unlearn those regularities 
and adapt to the ones of x. This process may cause some imprecision in symmetry that vanishes asymptotically with 
the length of x,y. A compressor must be poor indeed (and will certainly not be used to any extent) if it doesn't 
satisfy symmetry up to the required precision. Apart from stream-based, the other major family of compressors is 
block-coding based, like bzip2. They essentially analyze the full input block by considering all rotations in obtaining 
the compressed version. It is to a great extent symmetrical, and real experiments show no departure from symmetry. 

Distributivity: The distributivity property is not immediately intuitive. In Kolmogorov complexity theory the 
stronger distributivity property 

C(xyz)+C(z) <C(xz)+C(yz) (3.3) 
holds (with K = C). However, to prove the desired properties of NCD below, only the weaker distributivity property 

C(xy) + C(z) < C(xz) + C(yz) (3.4) 

above is required, also for the boundary case were C = K. In practice, real-world compressors appear to satisfy this 
weaker distributivity property up to the required precision. 

Definition 3.3 Define 

C(y\x)=C(xy)-C(x). (3.5) 

This number C(y\x) of bits of information in y, relative to x, can be viewed as the excess number of bits in (xy)* 
compared to x* , and is called the amount of conditional compressed information. 

In the definition of compressor the decompression algorithm is not included (unlike the case of Kolmorogov com- 
plexity, where the decompressing algorithm is given by definition), but it is easy to construct one: Given the file x* 
in C(x) bits, we can run the compressor on all candidate strings z — for example, in length-increasing lexicographical 
order, until we find the compressed string zo = x. Since this string decompresses to x we have found x = zo- Given 
the file (xy)* in C(xy) bits we repeat this process using strings xz until we find (xzi)* = {xy)*- Since this string 
decompresses to xy, we have found y = zi- By the unique decompression property we find that C(y|x) is the extra 
number of bits we require to describe y apart from describing x. It is intuitively acceptable that the conditional 
compressed information C(x|y) satisfies the triangle inequality 

C(x\y)<C(x\z)+C(z\y). (3.6) 

Lemma 3.4 Both dl3l and imply d3~4l 
Proof. ( (I3.3> implies d3.4> :) By monotonicity. 

C J3.6i implies (13. 4> :) Rewrite the terms in (13.61 according to (13. 5> . cancel C(y) in the left- and right-hand sides, 
use symmetry, and rearrange. □ 
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Lemma 3.5 A normal compressor satisfies additionally subadditivity: C(xy) < C(x) + C{y). 

PROOF. Consider the special case of distributivity with z the empty word so that xz ~x,yz — y, and C(z) = 0. □ 

Subadditivity: The subadditivity property is clearly also required for every viable compressor, since a compres- 
sor may use information acquired from x to compress y. Minor imprecision may arise from the unlearning effect of 
crossing the border between x and y, mentioned in relation to symmetry, but again this must vanish asymptotically 
with increasing length of x,y. 



3.2 Properties of the NCD : 

There is a natural interpretation to NCD(x,y): If, say, C(y) > C(x) then we can rewrite 

C(xy) — C(x) 



NCD(x,y) 



C(y) 



That is, the distance NCD(x,y) between x and y is the improvement due to compressing y using x as previously 
compressed "data base," and compressing y from scratch, expressed as the ratio between the bit-wise length of 
the two compressed versions. Relative to the reference compressor we can define the information in x about y as 
C(y)-C(y\x). Then, 

That is, the NCD between x and y is 1 minus the ratio of the information x about y and the information in y. 

Theorem 3.6 If the compressor is normal, then the NCD is a similarity metric. 

PROOF. (Metricity:) We first show that the NCD satisfies the three metric (in)equalities up to 0((logn)/«) 
additive precision, where n is the maximal binary length of an opject involved in the (in)equality concerned. 

(i) By idempotency we have NCD(x,x) = 0. By monotonicity we have NCD(x,y) > for every x,y. 

(ii) NCD(x,y) = NCD(y,x). The NCD is unchanged by interchanging x and y in ( 13. U . 

(iii) The difficult property is the triangle inequality. Without loss of generality we assume C(x) < C(y) < 
C(z). Since the NCD is symmetrical, there are only three triangle inequalities that can be expressed by 
NCD(x,y),NCD(x,z),NCD(y,z). We verify them in turn: 

Proof of NCD (jc,y) < NCD(x,z) + NCD(z,y): By distributivity, the compressor itself satisfies C(xy) + C(z) < 
C{xz) +C(zy). Subtracting C(x) from both sides and rewriting, C(xy) —C(x) < C{xz)—C(x) +C(zy)~C(z). Dividing 
by C(y) on both sides we find 

C{xy)-C{x) C(xz)-C(x)+C(zy)-C(z) 
C(y) " C(y) 

The left-hand side is < 1 . 

Case a: Assume the right-hand side is < 1. Setting C(z) = C(y) + A, and adding A to both the numerator and 
denominator of the right-hand side, it can only increase and draw closer to 1 . Therefore, 

C{xy)-C{x) C(xz) - C(x) + C(zy) - C(z) + A 
C(y) " C(y)+A 

_ C(zx)-C(x) C(zy)-C{y) 
C(z) C(z) ' 

which was what we had to prove. 

Case b: Assume the right-hand side is > 1. We proceed like in Case 2.1, and add A to both numerator and 
denominator. Although now the right-hand side decreases, it must still be greater than 1, and therefore the right- 
hand side remains at least as large as the left-hand side. 

Proof of NCD(x,z) < NCD(x,y) +NCD(y,z): By distributivity we have C(xz)+C(y) < C(xy)+C(yz). Adding 
C(x) to both sides, and dividing both sides by C(z) we obtain 

C(xz)-C{x) < C(xy)-C(x) C{yz)~C(y) 
C(z) ~ C(z) + C(z) ■ 

The right-hand side doesn't decrease when we substitute C(y) for the denominator C(z) of the first term, since 
C(y) < C(z). Therefore, the inequality stays valid under this substitution, which was what we had to prove. 
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Proof of NCD(y,z) < NCD(y,x) + NCD(x,z): By distributivity we have C(yz)+C(x) < C(yx)+C(xz). Adding 
C(y) to both sides, and dividing both sides by C(z) we obtain 



The right-hand side doesn't decrease when we substitute C(y) for the denominator C(z) of the first term, since 
C(y) < C(z). Therefore, the inequality stays valid under this substitution, which was what we had to prove. 

(Range:) We next verify the range of the NCD: The NCD(x,y) is always in between and 1, up to an additive 
O ( (log n)/n) term with n the maximum binary length of x,y. 

(Density:) It remains to show that the NCD satisfies (12. 2> . Since C(y) < C(x), the set condition NCD(x,y) < e 
can be rewritten (using symmetry) as C(xy) — C{y) < eC(x). Using C(y) < C(x) again, we also have C(xy) — C(x) < 
C(xy) —C(y). Together, 



Quasi-Universality: We now digress to the theory developed in 1291 . which formed the motivation for devel- 
oping the NCD. If, instead of the result of some real compressor, we substitute the Kolmogorov complexity for the 
lengths of the compressed files in the NCD formula, the result is a similarity metric. Let us call it the Kolmogorov 
metric. It is universal in the following sense: Every metric expressing similarity according to some feature, that can 
be computed from the objects concerned, is comprised (in the sense of minorized) by the universal metric. Note that 
every feature of the data gives rise to a similarity, and, conversely, every similarity can be thought of as expressing 
some feature: being similar in that sense. Our actual practice in using the NCD falls short of this ideal theory in at 
least three respects: 

(i) The claimed universality of the Kolmogorov metric holds only for indefinitely long sequences x,y. Once 
we consider strings x,y of definite length n, the Kolmogorov metric is only universal with respect to "simple" 
computable normalized information distances, where "simple" means that they are computable by programs of 
length, say, logarithmic in n. This reflects the fact that, technically speaking, the universality is achieved by summing 
the weighted contribution of all similarity metrics in the class considered with respect to the objects considered. Only 
similarity metrics of which the complexity is small (which means that the weight is large) with respect to the size of 
the data concerned kick in. 

(ii) The Kolmogorov complexity is not computable, and it is in principle impossible to compute how far off the 
NCD is from the Kolmogorov metric. So we cannot in general know how well we are doing using the NCD. 

(iii) To approximate the NCD we use standard compression programs like gzip, PPMZ, and bzip2. While better 
compression of a string will always approximate the Kolmogorov complexity better, this may not be true for the 
NCD. Due to its arithmetic form, subtraction and division, it is theoretically possible that while all items in the 
formula get better compressed, the improvement is not the same for all items, and the NCD value moves away from 
the asymptotic value. In our experiments we have not observed this behavior in a noticable fashion. Formally, we 
can state the following: 

Theorem 3.7 Let d be a computable similarity metric. Given a constant a > 0, let objects x,y be such that C{xy) — 
K(xy) <a. Then, NCD(x,y) < d(x,y) + (a + 0(1)) /k where k = max{C(x),C(y)}. 

PROOF. Fix d,C,x,y,k in the statement of the theorem. Without loss of generality we assume d(x,y) = e and 
C(x) = k. By d2~2l . there are < 2 {l+ ^ k+l many (u,v) pairs, such that C(u) < C(v) = k and d(u,v) < e. Since both C 
and d are computable functions, we can compute and enumerate all these pairs (u, v). The initially fixed pair (x,y) is 
an element in the list and its index takes < (1 +e)k+ 1 bits. It can be described by at most (1 +e)k + 0(l) bits — its 
index in the list and an 0(1) term accounting for the lengths of the programs involved in reconstructing (x,y) given 
its index in the list, including algorithms to compute functions d and C. Since the Kolmogorov complexity gives the 
length of the shortest effective description, we have K(xy) < ( 1 + e)k + 0(1). Then, 



C(yz)-C(y) C(yx)-C(x) C(yz)-C(y) 
C(z) ~ C(z) + C(z) 




Suppose that for some e there are > 2( 1+e ) c M +1 distinct xy satisfying (13. 7> . Since there are at most E-_o 2' = 
2(i+e)c(.v)+i _ ^ programs to encode the concatenations xy, by the pigeon hole principle, two distinct ones share the 
same code word. This contradicts the unique decompression property. □ 



NCD(x,y) 



C(xy) - C(x) K(xy) - C(x) + a 
C(x) " C{x) 



< 



eC(x)+a + 0(l) 
CM 



<e + 



g + 0(l) 



□ 
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So the NCD is quasi-universal in the sense that if for objects x,y the compression complexity C(xy) approximates 
the Kolmogorov complexity K(xy) up to much closer precision than C(x), then the NCD(x,y) minorizes the metric 
d(x,y) up to a vanishing term. More precisely, the error term (a + 0(l))/C(x) — > for a sequence of pairs x,y with 

C{y) < C(x), C{xy) -K(xy) < a and C(x) -> °°. 

Remark 3.8 Clustering according to NCD will group sequences together that are similar according to features that 
are not explicitly known to us. Analysis of what the compressor actually does, still may not tell us which features that 
make sense to us can be expressed by conglomerates of features analyzed by the compressor. This can be exploited 
to track down unknown features implicitly: forming automatically clusters of data and see in which cluster (if any) 
a new candidate is placed. 

Another aspect that can be exploited is exploratory: Given that the NCD is small for a pair x,y of specific se- 
quences, what does this really say about the sense in which these two sequences are similar? The above analysis 
suggests that close similarity will be due to a dominating feature (that perhaps expresses a conglomerate of subfea- 
tures). Looking into these deeper causes may give feedback about the appropriateness of the realized NCD distances 
and may help extract more intrinsic information about subject matter than the oblivious division into clusters by 
looking for the common features in the data clusters. 

4 Clustering 

Given a set of objects, the pairwise NCD's form the entries of a distance matrix. This distance matrix contains the 
pairwise relations in raw form. But in this format that information is not easily usable. Just as the distance matrix 
is a reduced form of information representing the original data set, we now need to reduce the information even 
further in order to achieve a cognitively acceptable format like data clusters. To extract a hierarchy of clusters from 
the distance matrix, we determine a dendrogram (binary tree) that agrees with the distance matrix according to a 
cost measure. This allows us to extract more information from the data than just flat clustering (determining disjoint 
clusters in dimensional representation). 

Clusters are groups of objects that are similar according to our metric. There are various ways to cluster. Our aim 
is to analyze data sets for which the number of clusters is not known a priori, and the data are not labeled. As stated in 
1 15 1, conceptually simple, hierarchical clustering is among the best known unsupervised methods in this setting, and 
the most natural way is to represent the relations in the form of a dendrogram, which is customarily a directed binary 
tree or undirected ternary tree. To construct the tree from a distance matrix with entries consisting of the pairwise 
distances between objects, we use a quartet method. This is a matter of choice only, other methods may work equally 
well. The distances we compute in our experiments are often within the range 0.85 to 1.2. That is, the distinguishing 
features are small, and we need a sensitive method to extract as much information contained in the distance matrix 
as is possible. For example, our experiments showed that reconstructing a minimum spanning tree is not sensitive 
enough and gives poor results. With increasing number of data items, the projection of the NCD matrix information 
into the tree representation format gets increasingly distorted. A similar situation arises in using alignment cost in 
genomic comparisons. Experience shows that in both cases the hierarchical clustering methods seem to work best 
for small sets of data, up to 25 items, and to deteriorate for larger sets, say 40 items or more. A standard solution 
to hierarchically cluster larger sets of data is to first cluster nonhierarchically, by say multidimensional scaling of 
fc-means, available in standard packages, for instance Matlab, and then apply hierarchical clustering on the emerging 
clusters. 

The quartet method: We consider every group of four elements from our set of n elements; there are such 
groups. From each group u, v,w,x we construct a tree of arity 3, which implies that the tree consists of two subtrees 
of two leaves each. Let us call such a tree a quartet topology. There are three possibilities denoted (i) uv\wx, (ii) 
mw|vx, and (iii) ux\vw, where a vertical bar divides the two pairs of leaf nodes into two disjoint subtrees (Figure^. 

For any given tree T and any group of four leaf labels u,v,w,x, we say T is consistent with uv\wx if and only if 
the path from u to v does not cross the path from w to x. Note that exactly one of the three possible quartet topologies 
for any set of 4 labels must be consistent for any given tree. We may think of a large tree having many smaller 
quartet topolgies embedded within its structure. Commonly the goal in the quartet method is to find (or approximate 
as closely as possible) the tree that embeds the maximal number of consistent (possibly weighted) quartet topologies 
from a given set Q of quartet topologies 1 18 1 (Figure[2}. This is called the (weighted) Maximum Quartet Consistency 
(MQC) problem. 

We propose a new optimization problem: the Minimum Quartet Tree Cost (MQTC), as follows: The cost of a 
quartet topology is defined as the sum of the distances between each pair of neighbors; that is, C uv \ wx = d(u,v) + 
d(w,x). The total cost Cj of a tree T with a set N of leaves (external nodes of degree 1) is defined as Cj = 
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Figure 1: The three possible quartet topologies for the set of leaf labels u,v,w,x 




Figure 2: An example tree consistent with quartet topology uv\wx 

H{u.v,w,x}cn{Cuv\wx '■ T is consistent with mv|wx} — the sum of the costs of all its consistent quartet topologies. First, 
we generate a list of all possible quartet topolgies for all four-tuples of labels under consideration. For each group of 
three possible quartet topologies for a given set of four labels u, v, w,x, calculate a best (minimal) cost m(u,v, w,x) = 
mm{C uv \ wx ,C uw \ vxl C ux \ vn }, and a worst (maximal) cost M(u,v,w,x) = max{C uv \ wx ,C llw \ vx ,C ux \ vw }. Summing all best 
quartet toplogies yields the best (minimal) cost m = Y*{ u ,v,w,x}cn m(u,v,w,x) . Conversely, summing all worst quartet 
toplogies yields the worst (maximal) cost M = Y.{u,v,w.x}cn M(u,v,w,x). For some distance matrices, these minimal 
and maximal values can not be attained by actual trees; however, the score Cj of every tree T will lie between these 
two values. In order to be able to compare tree scores in a more uniform way, we now rescale the score linearly 
such that the worst score maps to 0, and the best score maps to 1, and term this the normalized tree benefit score 
S(T) = (M — Ct) I (M — m). Our goal is to find a full tree with a maximum value of S(T), which is to say, the lowest 
total cost. 

To express the notion of computational difficulty one uses the notion of "nondeterministic polynomial time 
(NP)". If a problem concerning n objects is NP-hard this means that the best known algorithm for this (and a wide 
class of significant problems) requires computation time exponential in n. That is, it is infeasible in practice. The 
MQC decision problem is the following: Given n objects, let T be a tree of which the n leaves are labeled by the 
objects, and let Qj be the set of quartet topologies embedded in T. Given a set of quartet topologies Q, and an 
integer k, the problem is to decide whether there is a binary tree T such that Qf] Qj > k. In 1 18 1 it is shown that the 
MQC decision problem is NP-hard. For every MQC decision problem one can define an MQTC problem that has 
the same solution: give the quartet topologies in Q cost and the other ones cost 1 . This way the MQC decision 
problem can be reduced to the MQTC decision problem, which shows also the latter to be NP-hard. Hence, it is 
infeasible in practice, but we can sometimes solve it, and always approximate it. (The reduction also shows that the 
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quartet problems reviewed in 1 1 8 1 , are subsumed by our problem. ) Adapting current methods in 1 6 1 to our MQTC 
optimization problem, results in far too computationally intensive calculations; they run many months or years on 
moderate-sized problems of 30 objects. Therefore, we have designed a simple, feasible, heuristic method for our 
problem based on randomization and hill-climbing. First, a random tree with 2n — 2 nodes is created, consisting 
of n leaf nodes (with 1 connecting edge) labeled with the names of the data items, and n — 2 non-leaf or internal 
nodes labeled with the lowercase letter "n" followed by a unique integer identifier. Each internal node has exactly 
three connecting edges. For this tree T, we calculate the total cost of all embedded quartet toplogies, and invert and 
scale this value to find S(T). A tree is consistent with precisely A of all quartet topologies, one for every quartet. A 
random tree may be consistent with about | of the best quartet topologies — but because of dependencies this figure 
is not precise. The initial random this tree is chosen as the currently best known tree, and is used as the basis for 
further searching. We define a simple mutation on a tree as one of the three possible transformations: 

1. A leaf swap, which consists of randomly choosing two leaf nodes and swapping them. 

2. A subtree swap, which consists of randomly choosing two internal nodes and swapping the subtrees rooted at 
those nodes. 

3. A subtree transfer, whereby a randomly chosen subtree (possibly a leaf) is detached and reattached in another 
place, maintaining arity invariants. 

Each of these simple mutations keeps the number of leaf nodes and internal nodes in the tree invariant; only the 
structure and placements change. Define a full mutation as a sequence of at least one but potentially many simple 
mutations, picked according to the following distribution. First we pick the number k of simple mutations that we 
will perform with probability 2~ k . For each such simple mutation, we choose one of the three types listed above with 
equal probability. Finally, for each of these simple mutations, we pick leaves or internal nodes, as necessary. Notice 
that trees which are close to the original tree (in terms of number of simple mutation steps in between) are examined 
often, while trees that are far away from the original tree will eventually be examined, but not very frequently. In 
order to search for a better tree, we simply apply a full mutation on T to arrive at T', and then calculate S(T'). If 
S(T') > S(T), then keep T ! as the new best tree. Otherwise, try a new different tree and repeat. If S(T') ever reaches 
1, then halt, outputting the best tree. Otherwise, run until it seems no better trees are being found in a reasonable 
amount of time, in which case the approximation is complete. 
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Figure 3: Progress of a 60-item data set experiment over time 

Note that if a tree is ever found such that S(T) = 1, then we can stop because we can be certain that this tree 
is optimal, as no tree could have a lower cost. In fact, this perfect tree result is achieved in our artificial tree 
reconstruction experiment (Section |4.H reliably in a few minutes. For real-world data, S(T) reaches a maximum 
somewhat less than 1, presumably reflecting distortion of the information in the distance matrix data by the best 
possible tree representation, as noted above, or indicating getting stuck in a local optimum or a search space too 
large to find the global optimum. On many typical problems of up to 40 objects this tree-search gives a tree with 
S(T) > 0.9 within half an hour. For large numbers of objects, tree scoring itself can be slow (as this takes order n 4 
computation steps), and the space of trees is also large, so the algorithm may slow down substantially. For larger 
experiments, we use a C++/Ruby implementation with MPI (Message Passing Interface, a common standard used 
on massively parallel computers) on a cluster of workstations in parallel to find trees more rapidly. We can consider 
the graph mapping the achieved S(T) score as a function of the number of trees examined. Progress occurs typically 
in a sigmoidal fashion towards a maximal value < 1, Figure|3] 



12 



4.1 Three controlled experiments 

With the natural data sets we use, one may have the preconception (or prejudice) that, say, music by Bach should 
be clustered together, music by Chopin should be clustered together, and so should music by rock stars. However, 
the preprocessed music files of a piece by Bach and a piece by Chopin, or the Beatles, may resemble one another 
more than two different pieces by Bach — by accident or indeed by design and copying. Thus, natural data sets may 
have ambiguous, conflicting, or counterintuitive outcomes. In other words, the experiments on natural data sets have 
the drawback of not having an objective clear "correct" answer that can function as a benchmark for assessing our 
experimental outcomes, but only intuitive or traditional preconceptions. We discuss three experiments that show that 
our program indeed does what it is supposed to do — at least in artificial situations where we know in advance what 
the correct answer is. The similarity machine consists of two parts: (i) extracting a distance matrix from the data, 
and (ii) constructing a tree from the distance matrix using our novel quartet-based heuristic. 




Figure 4: The randomly generated tree that our algorithm reconstructed. S(T) = 1. 



Testing the quartet-based tree construction: We first test whether the quartet-based tree construction heuristic 
is trustworthy: We generated a ternary tree T with 18 leaves, using the pseudo-random number generator "rand" 
of the Ruby programming language, and derived a metric from it by defining the distance between two nodes as 
follows: Given the length of the path from a to b, in an integer number of edges, as L(a,b), let 



d(a,b) — 



L{a,l 



1 



18 



except when a — b, in which case d(a,b) = 0. It is easy to verify that this simple formula always gives a number 
between and 1, and is monotonic with path length. Given only the 18 x 18 matrix of these normalized distances, 
our quartet method exactly reconstructed the original tree T represented in Figure^ with S(T) = 1. 

Testing the similarity machine on artificial data: Given that the tree reconstruction method is accurate on 
clean consistent data, we tried whether the full procedure works in an acceptable manner when we know what the 
outcome should be like. We used the "rand" pseudo-random number generator from the C programming language 
standard library under Linux. We randomly generated 11 separate 1 -kilobyte blocks of data where each byte was 
equally probable and called these tags. Each tag was associated with a different lowercase letter of the alphabet. 
Next, we generated 22 files of 80 kilobyte each, by starting with a block of purely random bytes and applying one, 
two, three, or four different tags on it. Applying a tag consists of ten repetitions of picking a random location in the 
80-kilobyte file, and overwriting that location with the globally consistent tag that is indicated. So, for instance, to 
create the file referred to in the diagram by "a," we start with 80 kilobytes of random data, then pick ten places to 
copy over this random data with the arbitrary 1 -kilobyte sequence identified as tag a. Similarly, to create file "ab," 
we start with 80 kilobytes of random data, then pick ten places to put copies of tag a, then pick ten more places to 
put copies of tag b (perhaps overwriting some of the a tags). Because we never use more than four different tags, and 
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Figure 5: Classification of artificial files with repeated 1-kilobyte tags. Not all possiblities are included; for example, 
file "b" is missing. S(T) = 0.905. 

therefore never place more than 40 copies of tags, we can expect that at least half of the data in each file is random 
and uncorrelated with the rest of the files. The rest of the file is correlated with other files that also contain tags in 
common; the more tags in common, the more related the files are. The compressor used to compute the NCD matrix 
was bzip2. The resulting tree is given in Figure|5J it can be seen that the clustering has occured exactly as we would 
expect. The S(T) score is 0.905. 

Testing the similarity machine on natural data: We test gross classification of files based on markedly different 
file types. Here, we chose several files: (i) Four mitochondrial gene sequences, from a black bear, polar bear, fox, and 
rat obtained from the GenBank Database on the world-wide web; (ii) Four excerpts from the novel The Zeppelin 's 
Passenger by E. Phillips Oppenheim, obtained from the Project Gutenberg Edition on the World-Wide web; (iii) 
Four MIDI files without further processing; two from Jimi Hendrix and two movements from Debussy's Suite 
Bergamasque, downloaded from various repositories on the world-wide web; (iv) Two Linux x86 ELF executables 
(the cp and rm commands), copied directly from the RedHat 9.0 Linux distribution; and (v) Two compiled Java 
class files, generated by ourselves. The compressor used to compute the NCD matrix was bzip2. As expected, the 
program correctly classifies each of the different types of files together with like near like. The result is reported in 
Figure|6]with S(T) equal to the very high confidence value 0.984. This experiment shows the power and universality 
of the method: no features of any specific domain of application are used. 

5 Experimental Validation 

We developed the CompLearn Toolkit, Section^ and performed experiments in vastly different application fields to 
test the quality and universality of the method. The success of the method as reported below depends strongly on 
the judicious use of encoding of the objects compared. Here one should use common sense on what a real world 
compressor can do. There are situations where our approach fails if applied in a straightforward way. For example: 
comparing text files by the same authors in different encodings (say, Unicode and 8-bit version) is bound to fail. 
For the ideal similarity metric based on Kolmogorov complexity as defined in [29| this does not matter at all, but 
for practical compressors used in the experiments it will be fatal. Similarly, in the music experiments below we use 
symbolic MIDI music file format rather than wave format music files. The reason is that the strings resulting from 
straightforward discretizing the wave form files may be too sensitive to how we discretize. 
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Figure 6: Classification of different file types. Tree agrees exceptionally well with NCD distance matrix: S(T) = 
0.984. 

5.1 Genomics and Phylogeny 

In recent years, as the complete genomes of various species become available, it has become possible to do whole 
genome phylogeny (this overcomes the problem that useing different targeted parts of the genome, or proteins, may 
give different trees |34|). Traditional phylogenetic methods on individual genes depended on multiple alignment of 
the related proteins and on the model of evolution of individual amino acids. Neither of these is practically applicable 
to the genome level. In absence of such models, a method which can compute the shared information between two 
sequences is useful because biological sequences encode information, and the occurrence of evolutionary events 
(such as insertions, deletions, point mutations, rearrangements, and inversions) separating two sequences sharing a 
common ancestor will result in the loss of their shared information. Our method (in the form of the CompLearn 
Toolkit) is a fully automated software tool based on such a distance to compare two genomes. 

Mammalian Evolution: In evolutionary biology the timing and origin of the major extant placental clades (groups 
of organisms that have evolved from a common ancestor) continues to fuel debate and research. Here, we provide 
evidence by whole mitochondrial genome phylogeny for competing hypotheses in two main questions: the grouping 
of the Eutherian orders, and the Therian hypothesis versus the Marsupionta hypothesis. 

Eutherian Orders: We demonstrate (already in |29 1) that a whole mitochondrial genome phylogeny of the Eu- 
therians (placental mammals) can be reconstructed automatically from unaligned complete mitochondrial genomes 
by use of an early form of our compression method, using standard software packages. As more genomic material 
has become available, the debate in biology has intensified concerning which two of the three main groups of placen- 
tal mammals are more closely related: Primates, Ferungulates, and Rodents. In |7 1, the maximum likelihood method 
of phylogeny tree reconstruction gave evidence for the (Ferungulates, (Primates, Rodents)) grouping for half of the 
proteins in the mitochondial genomes investigated, and (Rodents, (Ferungulates, Primates)) for the other halves of 
the mt genomes. In that experiment they aligned 12 concatenated mitochondrial proteins, taken from 20 species: rat 
(Rattus norvegicus), house mouse (Mus musculus), grey seal (Halichoerus grypus), harbor seal (Phoca vitulina), cat 
(Felis catus), white rhino (Ceratotherium simum), horse (Equus caballus), finback whale (Balaenoptera physalus), 
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Figure 7: The evolutionary tree built from complete mammalian mtDNA sequences of 24 species, using the NCD 
matrix of Figure [9| We have redrawn the tree from our output to agree better with the customary phylogeny tree 
format. The tree agrees exceptionally well with the NCD distance matrix: S(T) = 0.996. 



Figure 8: Multidimensional clustering of same NCD matrix (Figure|9) as used for Figure0 Kruskal's stress-1 
0.389. 
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blue whale (Balaenoptera musculus), cow (Bos taurus), gibbon (Hylobates lar), gorilla (Gorilla gorilla), human 
(Homo sapiens), chimpanzee (Pan troglodytes), pygmy chimpanzee (Pan paniscus), orangutan (Pongo pygmaeus), 
Sumatran orangutan (Pongo pygmaeus abelii), using opossum (Didelphis virginiana), wallaroo (Macropus robus- 
tus), and the platypus (Ornithorhynchus anatinus) as outgroup. In 1 28 29 1 we used the whole mitochondrial genome 
of the same 20 species, computing the NCD distances (or a closely related distance in 1 28 1), using the GenCompress 
compressor, followed by tree reconstruction using the neighbor joining program in the MOLPHY package |36| to 
confirm the commonly believed morphology-supported hypothesis (Rodents, (Primates, Ferungulates)). Repeating 
the experiment using the hypercleaning method 1 6 1 of phylogeny tree reconstruction gave the same result. Here, we 
repeated this experiment several times using the CompLearn Toolkit using our new quartet method for reconstructing 
trees, and computing the NCD with various compressors (gzip, bzip2, PPMZ), again always with the same result. 
These experiments are not reported since they are subsumed by the larger experiment of Figure0 

Marsupionta and Theria: The extant monophyletic divisions of the class Mammalia are the Prototheria 
(monotremes: mammals that procreate using eggs), Metatheria (marsupials: mammals that procreate using pouches), 
and Eutheria (placental mammals: mammals that procreate using placentas). The sister relationships between these 
groups is viewed as the most fundamental question in mammalian evolution 1191 . Phylogenetic comparison by 
either anatomy or mitochondrial genome has resulted in two conflicting hypotheses: the gene-isolation-supported 
Marsupionta hypothesis: ((Prototheria, Metatheria), Eutheria) versus the morphology-supported Theria hypothesis: 
(Prototheria, (Methateria, Eutheria)), the third possiblity apparently not being held seriously by anyone. There has 
been a lot of support for either hypothesis; recent support for the Theria hypothesis was given in 1 19 1 by analyzing a 
large nuclear gene (M6P/IG2R), viewed as important across the species concerned, and even more recent support for 
the Marsupionta hypothesis was given in 1171 by phylogenetic analysis of another sequence from the nuclear gene 
(18S rRNA) and by the whole mitochondrial genome. 

Experimental Evidence: To test the Eutherian orders simultaneously with the Marsupionta- versus Theria 
hypothesis, we added four animals to the above twenty: Australian echidna (Tachyglossus aculeatus), brown bear 
(Ursus arctos), polar bear (Ursus maritimus), using the common carp (Cyprinus carpio) as the outgroup. Interest- 
ingly, while there are many species of Eutheria and Metatheria, there are only three species of now living Prototheria 
known: platypus, and two types of echidna (or spiny anteater). So our sample of the Prototheria is large. The addi- 
tion of the new species might be risky in that the addition of new relations is known to distort the previous phylogeny 
in traditional computational genomics practice. With our method, using the full genome and obtaining a single tree 
with a very high confidence S(T) value, that risk is not as great as in traditional methods obtaining ambiguous trees 
with bootstrap (statistic support) values on the edges. The mitochondrial genomes of the total of 24 species we used 
were downloaded from the GenBank Database on the world-wide web. Each is around 17,000 bases. The NCD 
distance matrix was computed using the compressor PPMZ. The resulting phylogeny, with an almost maximal S[T) 
score of 0.996 supports anew the currently accepted grouping (Rodents, (Primates, Ferungulates)) of the Eutherian 
orders, and additionally the Marsupionta hypothesis ((Prototheria, Metatheria), Eutheria), see Figure[7] Overall, our 
whole-mitochondrial NCD analysis supports the following hypothesis: 

Mammalia 

, A s 

( [primates, ferungulates) [rodents, [Metatheria , Prototheria)) ) , 
s , ' 

Eutheria 

which indicates that the rodents, and the branch leading to the Metatheria and Prototheria, split off early from the 
branch that led to the primates and ferungulates. Inspection of the distance matrix shows that the primates are very 
close together, as are the rodents, the Metatheria, and the Prototheria. These are tightly-knit groups with relatively 
close NCD's. The ferungulates are a much looser group with generally distant NCD's. The intergroup distances 
show that the Prototheria are furthest away from the other groups, followed by the Metatheria and the rodents. Also 
the fine-structure of the tree is consistent with biological wisdom. 

Hierarchical versus Flat Clustering: This is a good place to contrast the informativeness of hierarchical 
clustering with multidimensional clustering using the same NCD matrix, exhibited in Figure |9] The entries give a 
good example of typical NCD values; we truncated the number of decimals from 15 to 3 significant digits to save 
space. Note that the majority of distances bunches in the range [0.9, 1]. This is due to the regularities the compressor 
can perceive. The diagonal elements give the self-distance, which, for PPMZ, is not actually 0, but is off from 
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Figure 9: Distance matrix of pairwise NCD. For display purpose, we have truncated the original entries from 15 
decimals to 3 decimals precision. 



only in the third decimal. In Figure [8] we clustered the 24 animals using the NCD matrix by multidimenional 
scaling as points in 2-dimensional Euclidean space. In this method, the NCD matrix of 24 animals can be viewed 
as a set of distances between points in n-dimensional Euclidean space (n < 24), which we want to project into a 
2-dimensional Euclidean space, trying to distort the distances between the pairs as little as possible. This is akin to 
the problem of projecting the surface of the earth globe on a two-dimensional map with minimal distance distortion. 
The main feature is the choice of the measure of distortion to be minimized, 1 15 1. Let the original set of distances be 
d\, . . . ,dk and the projected distances be d[,... ,d' k . In Figure[8]we used the distortion measure Kruskall's stress-1, 

1221 . which minimizes {Y,i<k{di — d'j) 2 ) / 'Y*i<kdj '■ Kruskall's stress- 1 equal means no distortion, and the worst 
value is at most 1 (unless you have a really bad projection). In the projection of the NCD matrix according to 
our quartet method one minimizes the more subtle distortion S(T) measure, where 1 means perfect representation 
of the relative relations between every 4-tuple, and means minimal representation. Therefore, we should compare 
distortion Kruskall stress-1 with 1 — S(T). Figure0has a very good 1 - S(T) = 0.04 and Figure|8]has a poor Rruskal 
stress 0.389. Assuming that the comparison is significant for small values (close to perfect projection), we find that 
the multidimensional scaling of this experiment's NCD matrix is formally inferior to that of the quartet tree. This 
conclusion formally justifies the impression conveyed by the figures on visual inspection. 



SARS Virus: In another experiment we clustered the SARS virus after its sequenced genome was made publicly 
available, in relation to potential similar virii. The 15 virus genomes were downloaded from The Universal Virus 
Database of the International Committee on Taxonomy of Viruses, available on the world-wide web. The SARS 
virus was downloaded from Canada's Michael Smith Genome Sciences Centre which had the first public SARS 
Coronovirus draft whole genome assembly available for download (SARS TOR2 draft genome assembly 120403). 
The NCD distance matrix was computed using the compressor bzip2. The relations in Figure [K)] are very similar 
to the definitive tree based on medical-macrobio-genomics analysis, appearing later in the New England Journal of 
Medicine, 1231 . We depicted the figure in the ternary tree style, rather than the genomics-dendrogram style, since 
the former is more precise for visual inspection of proximity relations. 



Analysis of Mitochondrial Genomes of Fungi: As a pilot for applications of the CompLearn Toolkit in fungi ge- 
nomics reasearch, the group of T. Boekhout, E. Kuramae, V. Robert, of the Fungal Biodiversity Center, Royal Nether- 
lands Academy of Sciences, supplied us with the mitochondrial genomes of Candida glabrata, Pichia canadensis, 
Saccharomyces cerevisiae, S. castellii, S. servazzii, Yarrowia lipolytica (all yeasts), and two filamentous ascomycetes 
Hypocrea jecorina and Ve rticillium lecanii. The NCD distance matrix was computed using the compressor PPMZ. 
The resulting tree is depicted in Figure The interpretation of the fungi researchers is "the tree clearly clustered 
the ascomycetous yeasts versus the two filamentous Ascomycetes, thus supporting the current hypothesis on their 
classification (for example, see 1241 ). Interestingly, in a recent treatment of the Saccharomycetaceae, S. servazii, S. 
castellii and C. glabrata were all proposed to belong to genera different from Saccharomyces, and this is supported 
by the topology of our tree as well ([25 1)." 

To compare the veracity of the NCD clustering with a more feature-based clustering, we also calculated the 
pairwise distances as follows: Each file is converted to a 4096-dimensional vector by considering the frequency of 
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Figure 10: SARS virus among other virii. Legend: AvianAdenolCELO.inp: Fowl adenovirus 1; AvianlBl.inp: 
Avian infectious bronchitis virus (strain Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus 
(strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdenol.inp: Duck adenovirus 1; Hu- 
manAdeno40.inp: Human adenovirus type 40; HumanCoronal.inp: Human coronavirus 229E; MeaslesMora.inp: 
Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schwarz; MurineHepl l.inp: Murine hepati- 
tis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD l.inp: Enterobacteria phage PRD1; 
RatSialCorona.inp: Rat sialodacryoadenitis coronavirus; SARS.inp: SARS TOR2vl20403; SIRVl.inp: Sulfolobus 
virus SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2. S(T) = 0.988. 
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Figure 1 1 : Dendrogram of mitochondrial genomes of fungi using NCD. This represents the distance matrix precisely 
with S(T) = 0.999. 
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Figure 12: Dendrogram of mitochondrial genomes of fungi using block frequencies. This represents the distance 
matrix precisely with S(T) = 0.999. 
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Figure 13: Clustering of Native-American, Native- African, and Native-European languages. S(T) = 0.928. 



all (overlapping) 6-byte contiguous blocks. The 12-distance (Euclidean distance) is calculated between each pair of 
files by taking the square root of the sum of the squares of the component-wise differences. These distances are 
arranged into a distance matrix and linearly scaled to fit the range [0, 1.0]. Finally, we ran the clustering routine on 
this distance matrix. The results are in Figure[2] As seen by comparing with the NCD-based Figure^Jthere are 
apparent misplacements when using the Euclidean distance in this way. Thus, in this simple experiment, the NCD 
performed better, that is, agreed more precisely with accepted biological knowledge. 

5.2 Language Trees 

Our method improves the results of (Q, using a linguistic corpus of "The Universal Declaration of Human Rights 
(UDoHR)" 1331 in 52 languages. Previously, 1 1 1 used an asymmetric measure based on relative entropy, and the full 
matrix of the pair-wise distances between all 52 languages, to build a language classification tree. This experiment 
was repeated (resulting in a somewhat better tree) using the compression method in |29| using standard biological 
software packages to construct the phylogeny. We have redone this experiment, and done new experiments, using the 
CompLearn Toolkit. Here, we report on an experiment to separate radically different language families. We down- 
loaded the language versions of the UDoHR text in English, Spanish, Dutch, German (Native-European), Pemba, 
Dendi, Ndbele, Kicongo, Somali, Rundi, Ditammari, Dagaare (Native African), Chikasaw, Perhupecha, Mazahua, 
Zapoteco (Native- American), and didn't preprocess them except for removing initial identifying information. We 
used an Lempel-Ziv-type compressor gzip to compress text sequences of sizes not exceeding the length of the sliding 
window gzip uses (32 kilobytes), and compute the NCD for each pair of language sequences. Subsequently we clus- 
tered the result. We show the outcome of one of the experiments in Figure ^] Note that three groups are correctly 
clustered, and that even the subclusters of the European languages are correct (English is grouped with the Romance 
languages because it contains up to 40% admixture of words from Latin origine). 

5.3 Literature 

The texts used in this experiment were down-loaded from the world-wide web in original Cyrillic-lettered Russian 
and in Latin-lettered English by L. Avanasiev (Moldavian MSc student at the University of Amsterdam). The com- 
pressor used to compute the NCD matrix was bzip2. We clustered Russian literature in the original (Cyrillic) by 
Gogol, Dostojevski, Tolstoy, Bulgakov, Tsjechov, with three or four different texts per author. Our purpose was to 
see whether the clustering is sensitive enough, and the authors distinctive enough, to result in clustering by author. 
In Figure[2]we see a perfect clustering. Considering the English translations of the same texts, in Figure^] we see 
errors in the clustering. Inspection shows that the clustering is now partially based on the translator. It appears that 
the translator superimposes his characteristics on the texts, partially suppressing the characteristics of the original 
authors. In other experiments we separated authors by gender and by period. 

5.4 Music 

The amount of digitized music available on the internet has grown dramatically in recent years, both in the public 
domain and on commercial sites. Napster and its clones are prime examples. Websites offering musical content 
in some form or other (MP3, MIDI, . . . ) need a way to organize their wealth of material; they need to somehow 
classify their files according to musical genres and subgenres, putting similar pieces together. The purpose of such 
organization is to enable users to navigate to pieces of music they already know and like, but also to give them 
advice and recommendations ("If you like this, you might also like. . . "). Currently, such organization is mostly 
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DostoevskiiCrime ■ 
DostoevskiiPoorfolk ■ 
DostoevskiiGmbl ■ 
Dostoevsktildiot ■ 
TurgenevRudin ■ 
TurgenevGentlefolks ■ 
TurgenevEve ■ 
TurgenevOtcydeti ■ 
Tolstoylunosti ■ 
TolstoyAnnak ■ 
TolstoyWaM ■ 
GogolPortrDvaiv ■ 
GogolMertvye ■ 
GogolDik ■ 
GogolTaras ■ 
TolstoyKasak • 
BulgakovMaster ■ 
BulgakovEggs 
BulgakovDghrt ■ 



Figure 14: Clustering of Russian writers. Legend: I.S. Turgenev, 1818-1883 [Father and Sons, Rudin, On the Eve, A 
House of Gentlefolk]; F. Dostoyevsky 1821-1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. 
Tolstoy 1828-1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V. Gogol 1809-1852 [Dead Souls, 
Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891-1940 [The Master and 
Margarita, The Fatefull Eggs, The Heart of a Dog]. S(T) = 0.949. 
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Figure 15: Clustering of Russian writers translated in English. The translator is given in brackets after the titles of 
the texts. Legend: I.S. Turgenev, 1818-1883 [Father and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve 
(Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821-1881 [Crime and Punishment 
(Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 
1828-1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and 
Piece (L. and M. Aylmer)]; N.V. Gogol 1809-1852 [Dead Souls (C.J. Hogarth), Taras Bulba (« G. Tolstoy, 1860, 
B.C. Baskerville), The Mysterious Portrait + How the Two Ivans Quarrelled (« I.F. Hapgood]; M. Bulgakov 1891— 
1940 [The Master and Margarita (R. Pevear, L. Volokhonsky), The Fatefull Eggs (K. Gook-Horujy), The Heart of a 
Dog (M. Glenny)]. S(T) = 0.953. 



21 




Figure 16: Output for the 36 pieces from 3 music-genres. Legend: 12 Jazz: John Coltrane [Blue Trane, Giant Steps, 
Lazy Bird, Impressions]; Miles Davis [Milestones, Seven Steps to Heaven, Solar, So What]; George Gershwin 
[Summertime]; Dizzy Gillespie [Night in Tunisia]; Thelonious Monk [Round Midnight]; Charlie Parker [Yardbird 
Suite]; 12 Rock & Pop: The Beatles [Eleanor Rigby, Michelle]; Eric Clapton [Cocaine, Layla]; Dire Straits [Money 
for Nothing]; Led Zeppelin [Stairway to Heaven]; Metallica [One]; Jimi Hendrix [Hey Joe, Voodoo Chile]; The 
Police [Every Breath You Take, Message in a Bottle] Rush [Yyz]; 12 Classic: see Legend Figure S(T) =0.858. 

done manually by humans, but some recent research has been looking into the possibilities of automating music 
classification. 

Initially, we downloaded 36 separate MIDI (Musical Instrument Digital Interface, a versatile digital music for- 
mat available on the world-wide-web) files selected from a range of classical composers, as well as some popular 
music. The files were down-loaded from several different MIDI Databases on the world-wide web. The identifying 
information, composer, title, and so on, was stripped from the files (otherwise this may give a marginal advantage 
to identify composers to the compressor). Each of these files was run through a preprocessor to extract just MIDI 
Note-On and Note-Off events. These events were then converted to a player-piano style representation, with time 
quantized in 0.05 second intervals. All instrument indicators, MIDI control signals, and tempo variations were ig- 
nored. For each track in the MIDI file, we calculate two quantities: An average volume and a modal note. The 
average volume is calculated by averaging the volume (MIDI note velocity) of all notes in the track. The modal note 
is defined to be the note pitch that sounds most often in that track. If this is not unique, then the lowest such note 
is chosen. The modal note is used as a key-invariant reference point from which to represent all notes. It is denoted 
by 0, higher notes are denoted by positive numbers, and lower notes are denoted by negative numbers. A value of 
1 indicates a half-step above the modal note, and a value of —2 indicates a whole-step below the modal note. The 
tracks are sorted according to decreasing average volume, and then output in succession. For each track, we iterate 
through each time sample in order, outputting a single signed 8-bit value for each currently sounding note. Two 
special values are reserved to represent the end of a time step and the end of a track. This file is then used as input to 
the compression stage for distance matrix calculation and subsequent tree search. To check whether any important 
feature of the music was lost during preprocessing, we played it back from the preprocessed files to verify whether 
it sounded like the original. To the authors the pieces sounded almost unchanged. The compressor used to compute 
the NCD matrix of the genres tree, Figure^] and that of 12-piece music set, Figure^]is bzip2. For the full range 
of the music experiments see 1 8 1 . 
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Figure 17: Output for the 12-piece set. Legend: J.S. Bach [Wohltemperierte Klavier II: Preludes and Fugues 1,2 — 
BachWTK2{F,P}{l,2}]; Chopin [Preludes op. 28: 1, 15, 22, 24 — ChopPrel{l, 15,22,24}]; Debussy [Suite Berga- 
masque, 4 movements — DebusBerg{l,2,3,4}]. S(T) — 0.968. 

Before testing whether our program can see the distinctions between various classical composers, we first show 
that it can distinguish between three broader musical genres: classical music, rock, and jazz. This may be easier than 
making distinctions "within" classical music. All musical pieces we used are listed in the tables in the full paper (on 
the URL provided above). For the genre-experiment we used 12 classical pieces consisting of Bach, Chopin, and 
Debussy, 12 jazz pieces, and 12 rock pieces. The tree (Figure fTBT l that our program came up with has S(T) = 0.858. 
The discrimination between the 3 genres is reasonable but not perfect. Since S(T) — 0.858, a fairly low value, the 
resulting tree doesn't represent the NCD distance matrix very well. Presumably, the information in the NCD distance 
matrix cannot be represented by a dendrogram of high S(T) score. This appears to be a common problem with large 
(> 25 or so) natural data sets. Another reason may be that the program terminatedi, while trapped in a local optimum. 
We repeated the experiment many times with almost the same results, so that doesn't appear to be the case. The 11- 
item subtree rooted at n4 contains 10 of the 12 jazz pieces, together with a piece of Bach's "Wohltemporierte Klavier 
(WTK)". The other two jazz pieces, Miles Davis' "So What," and John Coltrane's "Giant Steps" are placed elsewhere 
in the tree, perhaps according to some kinship that now escapes us (but may be identified by closer studying of the 
objects concerned). Of the 12 rock pieces, 10 are placed in the 12-item subtree rooted at n29, together with a piece 
of Bach's "WTK," and Coltrane's "Giant Steps," while Hendrix's "Voodoo Chile" and Rush "Yyz" is further away. 
Of the 12 classical pieces, 10 are in the 13-item subtrees rooted at the branch n8,nl3,«6,«7, together with Hendrix's 
"Voodoo Chile," Rush's "Yyz," and Miles Davis' "So What." Surprisingly, 2 of the 4 Bach "WTK" pieces are placed 
elsewhere. Yet we perceive the 4 Bach pieces to be very close, both structurally and melodically (as they all come 
from the mono-thematic "Wohltemporierte Klavier"). But the program finds a reason that at this point is hidden from 
us. In fact, running this experiment with different compressors and termination conditions consistently displayed this 
anomaly. The small set encompasses the 4 movements from Debussy's "Suite Bergamasque," 4 movements of book 
2 of Bach's "Wohltemperierte Klavier," and 4 preludes from Chopin's "Opus 28." As one can see in Figure H71 
our program does a pretty good job at clustering these pieces. The S(T) score is also high: 0.968. The 4 Debussy 
movements form one cluster, as do the 4 Bach pieces. The only imperfection in the tree, judged by what one would 
intuitively expect, is that Chopin's Prelude no. 15 lies a bit closer to Bach than to the other 3 Chopin pieces. This 
Prelude no 15, in fact, consistently forms an odd-one-out in our other experiments as well. This is an example of 
pure data mining, since there is some musical truth to this, as no. 15 is perceived as by far the most eccentric among 
the 24 Preludes of Chopin's opus 28. 

5.5 Optical Character Recognition 

Can we also cluster two-dimensional images? Because our method appears focussed on strings this is not straightfor- 
ward. It turns out that scanning a picture in raster row-major order retains enough regularity in both dimensions for 
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Figure 18: Images of handwritten digits used for OCR. 




Figure 19: Clustering of the OCR images. S(T) = 0.901. 
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the compressor to grasp. A simple task along these lines is to cluster handwritten characters. The handwritten char- 
acters in Figure^]were downloaded from the NIST Special Data Base 19 (optical character recognition database) 
on the world-wide web. Each file in the data directory contains 1 digit image, either a four, five, or six. Each pixel is 
a single character; '#' for a black pixel, '.' for white. Newlines are added at the end of each line. Each character is 
128x128 pixels. The NCD matrix was computed using the compressor PPMZ. The Figure[T9lshows each character 
that is used. There are 10 of each digit "4," "5," "6," making a total of 30 items in this experiment. All but one of the 
4's are put in the subtree rooted at nl, all but one of the 5's are put in the subtree rooted at n4, and all 6's are put in the 
subtree rooted at n3. The remaining 4 and 5 are in the branch n23,«13 joining n6 and n3. So 28 items out of 30 are 
clustered correctly, that is, 93%. In this experiment we used only 3 digits. Using the full set of decimal digits results 
in a lower clustering accuracy. However, we can use the NCD as a oblivious feature-extraction technique to convert 
generic objects into finite-dimensional vectors. We have used this technique to train a support vector machine (S VM) 
based OCR system to classify handwritten digits by extracting 80 distinct, ordered NCD features from each input 
image. In this initial stage of ongoing research, by our oblivious method of compression-based clustering to supply 
a kernel for an SVM classifier, we achieved a handwritten single decimal digit recognition accuracy of 85%. The 
current state-of-the-art for this problem, after half a century of interactive feature-driven classification research, in 
the upper ninety % level B32II14I . All experiments are bench marked on the standard NIST Special Data Base 19 
(optical character recognition database). 

5.6 Astronomy 

As a proof of principle we clustered data from unknown objects, for example objects that are far away. In 1 3 1 obser- 
vations of the microquasar GRS 1915+105 made with the Rossi X-ray Timing Eplorer were analyzed. The interest 
in this microquasar stems from the fact that it was the first Galactic object to show a certain behavior (superluminal 
expansion in radio observations). Photonometric observation data from X-ray telescopes were divided into short 
time segments (usually in the order of one minute), and these segments have been classified into a bewildering array 
of fifteen different modes after considerable effort. Briefly, spectrum hardness ratios (roughly, "color") and photon 
count sequences were used to classify a given interval into categories of variability modes. From this analysis, the 
extremely complex variability of this source was reduced to transitions between three basic states, which, interpreted 
in astronomical terms, gives rise to an explanation of this peculiar source in standard black-hole theory. The data 
we used in this experiment made available to us by M. Klein Wolt (co-author of the above paper) and T. Maccarone, 
both researchers at the Astronomical Institute 'Anton Pannekoek", University of Amsterdam. The observations are 
essentially time series, and our aim was experimenting with our method as a pilot to more extensive joint research. 
Here the task was to see whether the clustering would agree with the classification above. The NCD matrix was 
computed using the compressor PPMZ. The results are in Figure |20] We clustered 12 objects, consisting of three 
intervals from four different categories denoted as 8,y,(j),9 in Table 1 of 0. In Figure[20]we denote the categories 
by the corresponding Roman letters D, G, P, and T, respectively. The resulting tree groups these different modes to- 
gether in a way that is consistent with the classification by experts for these observations. The oblivious compression 
clustering corresponds precisely with the laborious feature-driven classification in J3J. 

6 Conclusion 

To interpret what the NCD is doing, and to explain its remarkable accuracy and robustness across application fields 
and compressors, the intuition is that the NCD minorizes all similarity metrics based on features that are captured by 
the reference compressor involved. Such features must be relatively simple in the sense that they are expressed by 
an aspect that the compressor analyzes (for example frequencies, matches, repeats). Certain sophisticated features 
may well be expressible as combinations of such simple features, and are therefore themselves simple features in 
this sense. The extensive experimenting above shows that even elusive features are captured. 

A potential application of our non-feature (or rather, many-unknown-feature) approach is exploratory. Presented 
with data for which the features are as yet unknown, certain dominant features governing similarity are automatically 
discovered by the NCD. Examining the data underlying the clusters may yield this hitherto unknown dominant 
feature. 

Our experiments indicate that the NCD has application in two new areas of support vector machine (SVM) 
based learning. Firstly, we find that the inverted NCD (1-NCD) is useful as a kernel for generic objects in SVM 
learning. Secondly, we can use the normal NCD as a feature-extraction technique to convert generic objects into 
finite-dimensional vectors, see the last paragraph of Section l5~5l In effect our similarity engine aims at the ideal of 
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Figure 20: 16 observation intervals of GRS 1915+105 from four classes. The initial capital letter indicates the class 
corresponding to Greek lower case letters in |3|. The remaining letters and digits identify the particular observation 
interval in terms of finer features and identity. The T-cluster is top left, the P-cluster is bottom left, the G-cluster is 
to the right, and the D-cluster in the middle. This tree almost exactly represents the underlying NCD distance matrix: 
S(T)= 0.994. 

a perfect data mining process, discovering unknown features in which the data can be similar. This is the subject of 
current joint research in genomics of fungi, clinical molecular genetics, and radio-astronomy. 
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